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Abstract 



Let Xi , . . . , X n be independent and identically distributed random vectors with a log-concave 
(Lebesgue) density /. We first prove that, with probability one, there exists a unique maximum 
likelihood estimator /„ of /. The use of this estimator is attractive because, unlike kernel density 
estimation, the method is fully automatic, with no smoothing parameters to choose. Although 
the existence proof is non-constructive, we are able to reformulate the issue of computing /„ 
in terms of a non-differentiable convex optimisation problem, and thus combine techniques of 
computational geometry with Shor's r-algorithm to produce a sequence that converges to /„. 
For the moderate or large sample sizes in our simulations, the maximum likelihood estimator is 
shown to provide an improvement in performance compared with kernel-based methods, even 
when we allow the use of a theoretical, optimal fixed bandwidth for the kernel estimator that 
would not be available in practice. We also present a real data clustering example, which shows 
that our methodology can be used in conjunction with the Expectation-Maximisation (EM) 
algorithm to fit finite mixtures of log-concave densities. An R version of the algorithm is available 
in the package LogConcDEAD - Log-Concave Density Estimation in Arbitrary Dimensions. 

Keywords: Computational geometry, log-concavity, maximum likelihood estimation, non- 
differentiable convex optimisation, nonparametric density estimation, Shor's r-algorithm 



1 Introduction 



Modern nonparametric d ensity estimation began with the introduction of a kernel densi t y estim ator 
in the pioneering work of iFix and Hodeesl (|l95ll ). later republished as lFix and Hodeesl (|l989h . For 
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independent and identically distributed real-value d observations, the app ealing asympt otic theory of 
the mean integrated squared error was provided bv lRosenblattl (jl956h and lParzenl (l962l ). This theory 
leads to an asymptotically optimal choice of the smoothing parameter, or bandwidth. Unfortunately, 
however, it depends on the unknown density / through the integral of the square of the second 
derivative of /. Considerable effort has therefore been focused on finding methods of automatic 
bandwidth selection (cf. Wand and Jones, 199 5, Chapter 3, and the references therein). Although 
this has resulted in algorithms, e.g. IChiu that achieve the optimal rate of convergence of the 

relative error, namely O p (n -1 / 2 ), where n is the sample size, good finite sample performance is by 
no means guaranteed. 



This probl em is compound ed when the observations take values in M. d , 

estimator ( Deheuvelsl . Il977l ) requires the specification of a symmetric, positive definite d x d band- 



where the general kernel 



width matrix. The difficulties involved in making the d(d + l)/2 choices for its entries mean that 
attention is often restricted either to bandwidth matrices that are diagonal, or even to those that are 
scalar multiples of the identity matrix. Of course, practical issues of automatic bandwidth selection 



In this paper, we propose a fully automatic nonparametric estimator of /, with no tuning parameters 
to be chosen, under the condition that / is log-concave - that is, log / is a concave function. The 
class of log-concave densities has many attractive properties and has been well-studied, particularly 
in the economics, sampling and reliability theory literature. See Section [2] for further discussion of 
examples, applications and properties of log-concave densities. 

In Section [31 we show that if X\, . . . , X n are independent and identically distributed random vectors 
with a log-concave density, then with probability one there exists a unique log-concave density f n 
that maximises the likelihood function, 

n 

L(f)=nf(x i ). 

i=l 

Before continuing, it is worth noting that without any shape constraints on the densities under 
consideration, the likelihood function is unbounded. To see this, we could define a sequence (/„) of 
densities that represent successively close approximations to a mixture of n 'spikes' (one on each Xj), 
such as f n {x) — n~ l 5^ i=1 4>d,n- 1 ii x ~ Xi), where denotes the Nd(0, S) density. This sequence 
satisfies L(f n ) — > oo as n — * oo (cf. Figure [TJ. In fact, a modification of this argument may be used 
to show that the likelihood function remains unbounded even if we restrict attention to unimodal 
densities. 

Figure [5] gives a diagram illustrating the structure of the maximum likelihood estimator on the 
logarithmic scale. This structure is most easily visualised for two-dimensional data, where one can 
imagine associating a 'tent pole' with each observation, extending vertically out of the plane. For 
certain tent pole heights, the graph of the logarithm of the maximum likelihood estimator can be 
thought of as the roof of a taut tent stretched over the tent poles. The fact that the logarithm of 
the maximum likelihood estimator is of this 'tent function' form constitutes part of the proof of its 
existence and uniqueness. 

In Section we discuss the computational problem of how to adjust the n tent pole heights so that 
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Figure 1: Without any shape constraint on 
the class of densities, the likelihood function is 
unbounded, because we can take successively 
close approximations to a mixture of n 'spikes' 
(one on each Xi). 




(a) Density 




Figure 2: The 'tent-like' structure of the graph 
of the logarithm of the maximum likelihood es- 
timator for bivariate data. 




(b) Log-density 



Figure 3: Log-concave maximum likelihood estimates based on 1000 observations (plotted as dots) 
from a standard bivariate normal distribution. 



the corresponding tent functions converge to the logarithm of the maximum likelihood estimator. 
One reason that this computational problem is so challenging in more than one dimension is the fact 
that it is difficult to describe the set of tent pole heights that correspond to concave functions. The 
key observation, discussed in Section^ is that it is possible to minimise a modified objective function 
that it is convex (though non-differentiable) . This allows us t o apply the powerful non-differentiable 
convex optimisation methodology of the subg radient method dShorl. 19851) an d a variant called Shor's 
r-algorithm, which has been implemented bv lKappel and Kuntsevich (2000). 



As an illustration of the estimates obtained, Figure [3] presents plots of the maximum likelihood esti- 
mator, and its logarithm, for 1000 observations from a standard bivariat e normal distribution. These 
plots were created using the LogConcDEAD package dCule et aH 2008ah in R ( R Development Core 



Team, 



20081), which exploi ts the interactive surface-plotting software available in the rgl package 



(jAdler and Murdoch! . 120071 ) 



In Section O we present simulations to compare the finite-sample performance of the maximum 
likelihood estimator with kernel-based methods. The results are striking: even when we use the 
theoretical, optimal bandwidth for the kernel estimator (or an asymptotic approximation to this 
when it is not available), we find that the maximum likelihood estimator has a rather smaller 
mean integrated squared error for moderate or large sample sizes, despite the fact that this optimal 
bandwidth depends on properties of the density that would be unknown in practice. This suggests 
that the maximum likelihood estimator is able to adapt to the local smoothness of the underlying 
density automatically. 

Nonparametric density estimation is a fundamental tool for the visualisation of structure in ex- 
ploratory data analysis, and has an enormous literature that includes the monographs of Devroye 
and Gv5rfi (|l985l ). ISilvermanl (|l986l ). IScottJ (|l992h and IWand and Jonesl (|l995h . Our~proposed 
method may certainly be used for this purpose; however, it may also be used as an intermediary 
stage in more involved statistical procedures. For instance: 



In classification problems, we have p > 2 populations of interest, and assume in this discussion 
that these have densities /i, . . . , / p on M. d . We observe training data of the form {{Xi,Yi) : 
i = 1, . . . , rz.}, where if Yi = j, then Xi has density fj. The aim is to classify a new observation 
z G M. d as coming from one of the populations. Problems of this type occur i n a huge variet y 
of applicatio ns, including medical d iagnosis, archaeology, ecology etc. - see Gordon ( 198ll ). 



oi applicatio ns, mciuamg meaical di agnosis, arcnaeoiogy, ecology etc. - see Hjordoni i[ imp il l. 
Hand! (|l98ll ) or lDevrove elall (jl996h for further details and examples. A natural approach to 



classification problems is to construct density estimates f\, . . . ,f p , where fj is based on the 
rij observations, say, from the jth population, namely {A, : Yi = j}. We may then assign 
z to the jth population if rijfj(z) — max{ni/i(z), . . . , n p / p (z)}. In this context, the use of 
kernel-based estimators in general requires the choice of p separate d x d bandwidth matrices, 
while the corresponding procedure based on the log-concave maximum likelihood estimates is 
again fully automatic. 

Clustering problems are closely related to the classification problems described above. The 
difference is that, in the above notation, we do not observe Y\, . . . ,Y n , and have to assign 
each of X\, . . . , X n to one of the p populations. A common technique is based on fitting a 
mixture density of the form f(x) = X)?=i n jfj( x )i where the mixture proportions 7Ti,. ..,ir p 
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are positive and sum to one. Under the assumption that each of the component densities 
fi, . . . , f p is log-concave, we show in Section [6] that our methodology can be extended to fit 
such a finite mixture density, which need not itself be log-concave - cf. Section ® We also 
illustrate this clustering algorithm on a Wisconsin breast cancer data set in Section [HI where 
the aim is to separate observations into benign and malignant component populations. 

3. A functional of the true underlying density may be estimated by the corresponding functional of 
a density estimator, such as the log-concave maximum likelihood estimator. Examples of func- 
tionals of interest include probabilities, such as /ii !B i|> 1 f(%) dx, moments, e.g. J ||j;|| 2 /(a;) dx, 
and the differential entropy, — J f(x) log f(x) dx. It may be possible to compute the plug-in es- 
timator based on the log-concave maximum likelihood estimator analytically, but in Section [71 
we show that even if this is not possible, in many cases of interest we can sample from the 
log-concave maximum likelihood estimator f n , and hence obtain a Monte Carlo estimate of 
the functional. This nice feature also means that the log-concave maximum likelihood estima- 
tor can be used in a Monte Carlo bootstrap procedure for assessing uncertainty in functional 
estimates - see Section [JJ for further details. 

4. The fitting of a nonparametric density estimate may give an indication of the validity of a 
particular smaller model (often parametric). Thus, a contour plot of the log-concave maximum 
likelihood estimator may provide evidence that the underlying density has elliptical contours, 
and thus suggest that a model that exploits this elliptical symmetry. 



In the univariate case, IWaltherl (|2002l ) describes methodology based on log-concave density 
estimation for addressing the problem of detecting the presence of m ixing in a distribution. 
As an application, he cites the Pickering/Platt debate (| Swales . 1985( ) on the issue of whether 



high blood pressure is a disease (in which case observed blood pressure measurements should 
follow a mixture distribution), or simply a label attached to people in the right tail of the 
blood pressure distribution. As a result of our algorithm for computing the multidimensional 
log-concave maximum likelihood estimator, this methodology extends immediately to more 
than one dimension. 



There has been considerable recent interest in shape-restricted nonparametric density estimation, but 
most of it has been confined to the case of univariate densities, where the computational algorithms 
are more straightforward. Nevertheless, as was discussed above, it is in multivariat e situations tha t 
the automatic nature of the m axim um likelihood es timator is particularly valuable. IWaltherl (|2002f ). 
Dumbeen and Rufibachl (|2007l ) and lPal et aJ\ (|2007l ) have proved the existence and uniqu e ness o f the 
log-concave maximum likelihood estimator in one dimension and Diimbgcn and Rufibach 

(200?L Pal 

et al. (|2007l ) and lBalabdaoui et all (|2008l ) have studied its theoretical properties. iRufibachl (|2007r ) has 
compared different a lgorithms for computing the univariate estimator, including the iterative convex 
min orant algorithm dGroeneboom and Wellnerl.ll992l;|jongbloedl.ll998l). and three others. Diimbgen 
et al. (2007) also present an Active Set alg orithm, which has similarit ies with the vertex direction and 
vertex reduction algorithms described in iGroeneboom et all (2008). For univariate data, it is also 
well-know n that there exist maximum likelihood estimators of a non-increasing den s ity su pported 
on [0, oo) ( Grenanderl . Il956h and of a convex, decreasing density ( Groeneboom et al. L l200lh . 



In Section [5J we give a brief concluding discussion, and suggest some directions for future research. 
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Finally, we present in Appendix [X] a glossary of terms and results from convex analysis and com- 
putational geomet ry that appear in italics at their fi rst occurrence in the main body of the paper; 
the references are Rockafellan ( 1997 ) and iLed (1997). Proofs are deferred to Appendix iBl except 
that the beginning of the proof of Theorem [5] is given in the main text, as the ideas and notation 
introduced are needed in the remainder of the paper. 



Log-concave densities: examples, applications and proper- 
ties 



Many of the most commonly-encountered parametric families of univariate distributions have log- 
concave densities, including the family of normal distributions, gamma distributions with shape 
parameter at least one, Beta(a,/3) distributions with a, [3 > 1, W eibull distributions with shape 
parameter at least one, Gumbel, logistic and Laplace densities; see Bagnoli and Bergstrom ( 19891 ) 
for other examples. Univariate log-concave densities are unimodal and have fairly light tails - it may 
help to think of the exponential distribution (where the logarithm of the density is a linear function 
on the positive half-axis) as a borderline case. Thus Cauchy, Pareto and lognormal densities, for 
instance, are not log-concave. Mixtures of log-concave densities may be log-concave, but in general 
they are not; for instance, forp £ (0, 1), the location mixture of standard univariate normal densities 
f(x) =pcj){x) + (1 — p)4>{x — n) is log-concave if and only if \\^\\ < 2. 



The assumption of log-concavity is a popular one in economics; ICanlin and Naelbufti (jl991bl ) show 
that in the theory of elections and under a log-concavity assumption, the proposal most preferred 
by the mean voter is u nbeatable under a 64% majo rity rule. As another example, in the theory of 
imperfect competition, ICaplin and NaelbufB (|1991al ) use log-concavity of the density of consumers' 
utility parameters as a sufficient condition in their proof of the existence of a pure-strategy price 
equilib rium for any number of firms producing any set of prod ucts. S ee Bagnoli and Bergstroml 
(1 19891 ) f or many other applications of log-concavitv to economics. iBrooks (1 19981) and Mengersen and 
Tweedie (|l996f ) have exploited the properties of log-concave densities in studying the convergence 
of Markov chain Monte Carlo sampling procedures. 
□ 



An (|l998f ) lists many useful properties of log-concave densities. For instance, if / and g are (possibly 
multidimensional) log-concave densities, then their convolution / * g is log-concave. In other words, 
if X and Y are independent and have log-concave densities, then their sum X + Y has a log-concave 
density. The class of log-concave densities is also closed under the taking of pointwise limits. One- 
dimensional log-concave densit ies have inc r easing hazard functions, which is why they are of interest 
in reliability theory. Moreover. Ilbragimov ( 19561 ) proved the following characterisation: a univariate 
density / is log-concave if and only if the convolution / * g is unimodal for every unimodal density 
g. There is no natural generalisation of this result to higher dimensions. 

As was mentioned in SectionQ] this paper concerns multidimensional log-concave densities, for which 
fewer properties are known. It is therefore of interest to understand how the property of log-concavity 
in more than one dimension relates to the univariate notion. Our first proposition below is intended 
to give some insight into this issue. It is not formally required for the subsequent development of our 
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methodology in Sections [3] and dj although we did apply the result when designing our simulation 
study in Section [5] We assume throughout that log-concave densities are with respect to Lebesgue 
measure on the affine hull of their support, and 'X has a log-concave density' means 'there exists a 
version of the density of X that is log-concave'. 

Proposition 1. Let X be a d-variate random vector having density f with respect to Lebesgue 
measure on M. d . For a subspace V ofW. d , let Pv(%) denote the orthogonal projection of x onto V. 
Then in order that f be log-concave, it is: 



1. necessary that for any subspace V , the marginal density of Py(X) is log-concave and the 
conditional density fx\p v (x){'\t) of X given Py(X) = t is log-concave for each t 

2. sufficient that for every (d — 1) -dimensional subspace V, the conditional density fx\p v (x){'\t) 
of X given Py(X) = t is log-concave for each t. 



T he part of Propo sitionQJa) concerning marginal densities is an immediate consequence of Theorem 6 
of iPrekopa One can regard Proposition [ljb) as saying that a multidimensional density is 

log-concave if the restriction of the density to any line is a (univariate) log-concave function. 



It is interesting to compare the properties of log-concave densities presented in Proposition [T] with 
the corresponding properties of Gaussian densities. In fact, Proposition [T] remains true if we replace 
'log-concave' with 'Gaussian' throughout (at least, provided that in part (b) we also assume there is a 
point at which / is twice diffcrentiable) . These shared properties suggest that the class of log-concave 
densities is a natural, infinite-dimensional generalisation of the class of Gaussian densities. 



3 Existence, uniqueness and structure of the maximum like- 
lihood estimator 



Let !Fq denote the class of log-concave densities on K d with d- dimensional support, and let fo G T§. 
The degenerate case where the support is of dimension smaller than d can also be handled, but for 
simplicity of exposition we concentrate on the non-degenerate case. Suppose that X\, . . . ,X n are 
a random sample from Jo- We say that f n = f n (X\, . . . ,X n ) G J-q is a (nonparametric) maximum 
likelihood estimator of fo if it maximises 1(f) = ^"=1 l°S/(-^») over / G Tq. 

Theorem 2. Suppose that n > d + 1. Then, with probability one, a nonparametric maximum 
likelihood estimator f n of fo exists and is unique. 



First Part of Proof. We may assume that X\, . . . , X n are distinct and their convex hull, 
C n = conv(Xi, . . . , X n ), is a d-dimensional polytove (an even t of probability one when n > d + 1). 
By a standard argument in convex analysis (|Rockafellarl . 119971 . p. 37), for each y = . . . , y n ) G M. n 



there exists a function h y 
h y (Xi ) > yi for alH = 1 , 



-> K with the property that h y is the least concave function satisfying 
Informally, h y is a 'tent function', and a typical example is depicted 
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in Figure [21 Let TL = {h y : y S R"} denote 'the class of tent functions'. Let T denote the set of all 
log-concave functions on R d , and for / € J 7 , define 



1 ™ f 
Mf) = ~ y>g /(*()- / f(x)dx. 



Suppose that / maximises ipn(') over T. The main part of the proof, which is completed in the 
Appendix, consists of showing that 

(i) f(x) > for x 6 C n 

(ii) f(x) = for x i C n 

(iii) log fen 

(iv) feT a 

(v) there exists M > such that if max^ > M, then ip n (exp(h y )) < ip n (f)- 

Although step (iii) above gives us a finite-dimensional class of functions to which log /„ belongs, the 
proof of Theorem [2] gives no indication of how to find the member of this class that maximises the 
likelihood function. We therefore seek an iterative algorithm to compute the estimator, but first we 
describe the structure we see in Figure [2] in Section [1] more precisely. From now on, we assume: 

(Al): n > d+ 1, and every subset of {Xi, . . . , X n } of size d + 1 is affinely independent. 

Note that when n > d + 1, the event in (Al) has probability one. From step (iii) in the proof of 
Theorem [5] above, there exists y £ R™ such that log / ra = h v . As i llustr ated in Figure^ and justified 
formally by Corollary 17.1.3 and Corollary 19.1.2 of lRockafellarl (jl997h . the convex hull of the data, 



C n , may be triangulated in such a way that log /„ coincides with an affine function on each simplex 
in the triangulation. In other words, if j = (Ji, . . . ,jd+i) is a (d + l)-tuple of distinct indices in 
{1, . . . ,rt}, and C n j = conv(Xj 17 . . . ,Xj d+1 ), then there exists a finite set J consisting of m such 
(d + l)-tuples, with the following three properties: 

(i) Uj£jC n j = C n 

(ii) the relative interiors of the sets {C raj : j £ J} are pairwise disjoint 
(iii) 

(x, bj) — f3j if x G C n j for some j £ J 



log/„(x) 



x if x 4. C n 



for some 6 1; . . . , b m 6 R d and /?!,..., [3 m el. Here and below, (•, •} denotes the usual Euclidean 
inner product in M. d . 

In the iterative algorithm that we propose in Section [4] for computing the maximum likelihood 
estimator, we need to find convex hulls and triangulations at each itera t ion. F ortunately, these can 
be computed efficiently using the Quickhull algorithm o f lBarber et all (|l99ffl . 
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4 Computation of the maximum likelihood estimator 
4.1 Reformulation 

As a first attempt to find an algorithm which produces a sequence that converges to the maximum 
likelihood estimator in Theorem^ it is natural to try to minimise numerically the function 



Although this approach might work in principle, one difficulty is that r is not convex, so this approach 
is extremely computationally intensive, even with relatively few observations. Another reason for 
the numerical difficulties stems from the fact that the set of y- values on which r attains its minimum 
is rather large: in general it may be possible to alter particular components yi without changing h y . 
Of course, we could have defined r as a function of h y rather than as a function of the vector of tent 
pole heights y = (y\, . . . ,y n )- Our choice, however, motivates the following definition of a modified 
objective function: 



The great advantages of minimising a rather than r are seen by the following theorem. 

Theorem 3. Assume (Al). The function a is a convex function satisfying a >t. It has a unique 
minimum at y* £ M™, say, and log/ n = h y * . 

Thus Theorem [3] shows that the unique minimum y* — (y^, ...,?/*) of a belongs to the minimum 
set of t. In fact, it corresponds to the element of the minimum set for which h y *(Xi) = y* for 
i = 1, . . . , n. Informally, then, h y * is 'a tent function with all of the tent poles touching the tent'. 

In order to compute the function a at a generic point y = (yx, . . . ,y n ) £ R™, we need to be able to 
evaluate the integral in (|4.1[) . In the notation of Section [3J we may write 



For each j = (ji, . . . , jd+i) £ J, let Aj be the d x d matrix whose Ith column is Xj l+1 — Xj t for 
I = 1, . . . , g?, and let aj = Xj ± . Then the affine transformation w i— > AjW + aj takes the unit simplex 
T d = {w = (wi,...,Wd) ■ wi > 0,Y,i=i w l < 1} to Letting Zjj = y Jl+1 - y^, we can then 

establish by a simple change of variables and induction on d that if Zj i, . . . , Zj^ are non-zero and 
distinct, then 





(4.1) 





(4.2) 
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Further details of this calculation can be found in a longer version of this paper (jCule et all l2008bl ) . 
The singularities that occur when some of Zj t i, . . . , may be zero or equal are removable 
although (|4.2p is a little complicated, it allows the computation of our objective function. 



Thus, 



4.2 Nonsmooth optimisation 



There is a vast literature on techniques of convex optimisation (cf. iBovd and Vandenberghei ( 2004 ), 



for example), including the method of steepest descent and Newton's method. Unfortunately, these 
methods rely on the differentiability of the objective function, and the function a is not differentiable. 
This can be seen informally by studying the schematic diagram in Figure [5] again. If the ith. tent 
pole, say, is touching but not critically supporting the tent, then decreasing the height of this tent 
pole does not change the tent function, and thus does not alter the integral in (|4.1[) ; on the other 
hand, increasing the height of the tent pole does alter the tent function and therefore the integral 
in (|4.ip . This argument may be used to show that at such a point, the ith partial derivative of a 
does not exist. 

The set of points at which a is not differentiable constitute a set of Lebesgue measure zero, but the 
non-differentiability cannot be ignored in our optimisation procedure. Instead, it will be necessary to 
derive a subgradient of a at each point y G R". This derivation, along with a more formal discussion 
of the non-diffcrentiability of a, can be found in the Appendix. 

The theory of non-differentiable, convex optimisation is perhaps less well-kn own than its differen- 
tiable counterpart, but a fundamental contribution was made bv lShor (1985) with his introduction 



of the subgradient method for minimising non-differentiable, convex functions defined on Euclidean 
spaces. A slightly specialised version of his Theorem 2.2 gives that if da(y) is a subgradient of a at 
y, then for any £ W 1 , the sequence generated by the formula 

y V ni+1 \\da(yW)\\ 

has the property that either there exists an index I* such that > — y*, or y^' — > y* and 
a(y^) — > a(y*) as I — > oo, provided we choose the step lengths hg so that hg — > as I — > oo, but 

XXi he = oo. 

Shor recognised, however, that the convergence of this algorithm could be slow in practice, and that 
although appropriate step size selection could improve matters somewhat, the convergence would 
never be bette r than linear (compared with qu adratic convergence for Newton's method near the 
optimum - see Bovd and Vandenberghei (2004, Section 9.5)). Slow convergence can be caused by 



taking at each stage a step in a direction nearly orthogonal to the direction towards the optimum, 
which means that simply adjusting the step size selection scheme will never produce the desired 
improvements in convergence rate. 



One solution (jShorl . 1 19851 . Chapter 3) is to attempt to shrink the angle between the subgradient and 
the direction towards the minimum through a (necessarily nonorthogonal) linear transformation, 
and perform the subgradient step in the transformed space. By analogy with Newton's method for 
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smooth functions, an appropriate transformation would be an approximation to the inverse of the 
Hessian matrix at the optimum. This is not possible for nonsmooth problems, because the inverse 
might not even exist (and will not exist at points at which the function is not differentiable, which 
may include the optimum). 

Instead, we perform a sequence of dilations in the direction of the difference between two succes- 
sive subgradients, in the hope of improving convergence in the worst-case scenario of steps nearly 
perpendicular to the direction towards the minimiser. This variant, whi ch has become known as 
Shor's r-algorithm, has been implemented in Kappel and Kuntsevich ( 2000h . Accompanying software 



SolvOpt is available from http://www.uni-graz.at/imawww/kuntsevich/solvopt/. 

Although the formal convergence of the r-algorithm has not been proved, we agree with the au- 
thors' claims that it is robust, efficient and accurate. Of course, it is clear that if we terminate 
the r-algorithm after any finite number of steps and apply the original Shor algorithm using our 
terminating value of y as the new starting value, then formal convergence is guaranteed. We have 
not found it necessary to run the original Shor algorithm after termination of the r-algorithm in 
practice. 

If (y^) denotes the sequence of vectors in R" produced by the r-algorithm, we terminate when 



\a(y(^)-a(y^)\<6 
| y f +1 >- y W|<eforz = l,. 
|1 — J exp{h y (e) (x)} dx\ < r\ 



for so me small S, e and r\ > 0. The first two termination criteria follow Kappel and Kuntsevich] 



(l200(t l. while the third is based on our knowledge that the true optimum corresponds to a density 
(Section [3]). As default values, and throughout this paper, we took 5 = 1CT 8 and e = r) = 1(T 4 . 

Table [T] gives approximate running times and number of iterations of Shor's r-algorithm required 
for different sample sizes and dimensions on an ordinary desktop computer (1.8GHz, 2GB RAM). 
Unsurprisingly, the running time increases relatively quickly with the sample size, while the number 
of iterations increases approximately linearly with n. Each iteration takes longer as the dimension 
increases, though it is interesting to note that the number of iterations required for the algorithm 
to terminate decreases as the dim ension increases. When d = 1, we recommend the Active Set 
algorithm of Dumbgen et ~d\ (1200 <tl. which is implemented in the R package logcondens fRufibach 
and Dumbgen 



Jumbgcr 
,|2QQ6|). 



5 Finite sample performance 

Our simulation study considered, for d = 2 and 3, the following densities: 
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Table 1: Approximate running times (with number of iterations in brackets) for computing the 

maximum likelihood estimator of a log-concave density 

n = 100 n = 500 n = 1000 n = 2000 

d = 2 1.5 sees (260) 50 sees (1270) 4 mins (2540) 24 mins (5370) 
d = 3 6 sees (170) 100 sees (820) 7 mins (1530) 44 mins (2740) 
d = 4 23 sees (135) 670 sees (600) 37 mins (1100) 224 mins (2060) 



(a) standard normal, 4>d = 4>d,i 

(b) dependent normal, 0d,E, with Sjj = l{i=j} +0.21^^} 

(c) the joint density of independent T(2, 1) components 

(d-f) the normal location mixture 0.6</> d (-) + 0A(j) d (- - for (d) = 1, (e) \\pL\\ = 2, (f) ||/_t|| = 3. 
An application of Proposition!!] gives that such a normal location mixture is log-concave if and 
only if || ^ II < 2. 

In Tables [2] and we present, for each density and for four different sample sizes, an estimate of the 
mean integrated squared error (MISE) of the nonparametric maximum likelihood estimator based 
on 100 Monte Carlo iterations. We also show the MISE for the kernel density estimates with a 
Gaussian kernel and, for all of the normal and mixture of normal examples, the choice of bandwidth 
that minimises the MISE. In the gamma example, exact MISE calculations are not possible, so we 
took the bandwidth that minimises the asymptotic mean inte grated squared error (A MISE). These 
optimal bandwidths can be computed using the formulae in IWand and Joned (|l995l . Sections 4.3 



and 4.4). As minimisation of the expressions for both the MISE and the AMISE requires knowledge 
of certain functionals of the true density that would be unknown in practice, we also provide a 
comparison with an em pirical bandwidth selector based on least squares cross validation (LSCV) 
and and Jonesl . ll995l . Section 4.7). The LSCV bandwidths were computed using the ks package 



Duongl . 120071 ) in R, and we used the option of constraining the bandwidth matrices to be diagonal 



in cases (a) and (c) where the components are independent. 

We see that in cases (a)-(e) the log-concave maximum likelihood estimator has a smaller MISE 
than the kernel estimate with bandwidth chosen by LSCV, and at least for moderate and large 
sample sizes, the difference is quite dramatic. Even more remarkably, in these cases the log-concave 
estimator also outperforms the kernel estimate with optimally chosen bandwidth when the sample 
size is not too small. It seems that for small sample sizes, the fact that the convex hull of the data 
is rather small hinders the performance of the log-concave estimator, but that this effect is reduced 
as the sample size increases. The log-concave estimator copes well with the dependence in case (b), 
and it also deals particularly impressively with case (c), where the true density decays to zero at the 
boundary of the positive orthant. 

In case (f), where the log-concavity assumption is violated, the performance of our estimator is not 
as good as the kernel estimate with the optimally chosen bandwidth, but is still comparable in most 
cases with the LSCV method. One would not expect the MISE of /„ to approach zero as n — > oo 
if log-concavity is violated, and in fact we conjecture that in this case the log-concave maximum 
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Table 2: Mean integrated squared error estimates (with standard errors in brackets where applicable; 
d= 2) 

(a) Independent Normal 



n 


LogConcDEAD 


Kernel (opt MISE) 


Kernel (LSCV) 


100 


0.00620(0.000222) 


0.00431 


0.00622(0.000383) 


500 


0.00161(0.0000514) 


0.00164 


0.00199(0.0000844) 


1000 


0.000983(0.0000289) 


0.00106 


0.00122(0.0000495) 


2000 


0.000599(0.0000155) 


0.000686 


0.000803(0.0000276) 






(b) Dependent Normal 




n 


LogConcDEAD 


Kernel (opt MISE) 


Kernel (LSCV) 


100 


0.00607(0.000283) 


0.00440 


0.00827(0.000583) 


500 


0.00168(0.0000573) 


0.00167 


0.00240(0.000122) 


1000 


0.00100(0.0000295) 


0.00108 


0.00142(0.0000662) 


2000 


0.000608(0.0000154) 


0.000700 


0.000868(0.0000331) 



n 

100 
500 
1000 
2000 



n 

100 
500 
1000 
2000 



n 

100 
500 
1000 
2000 



n 

100 
500 
1000 
2000 



(c) r(2, 1) (independent components) 



LogConcDEAD 

0.00588(0.000222) 
0.00143(0.0000478) 
0.000802(0.0000236) 
0.000451(0.0000110) 



Kernel (opt AMISE) 

0.00644 
0.00220 
0.00139 
0.000874 



Kernel (LSCV) 

0.00800(0.000339) 
0.00291(0.0000687) 
0.00194(0.0000456) 
0.00130(0.0000209) 



(d) Normal location mixture, 



LogConcDEAD 

0.00504(0.000206) 
0.00136(0.0000745) 
0.000747(0.0000622) 
0.000543(0.0000553) 



Kernel (opt MISE) 

0.00384 
0.00145 
0.000945 
0.000610 



(e) Normal location mixture, \\fi\\ 



LogConcDEAD 

0.00434(0.00158) 
0.000996(0.0000622) 
0.000640(0.0000502) 
0.000445(0.0000455) 



Kernel (opt MISE) 

0.00304 
0.00117 
0.000760 
0.000492 



(f) Normal location mixture, 



LogConcDEAD 

0.00467(0.000139) 

0.00173(0.0000522) 

0.00122(0.0000456) 

0.00105(0.0000340) 



Kernel (opt MISE) 

0.00326 
0.00126 
0.000819 
0.000530 



Kernel (LSCV) 

0.00515(0.000195) 
0.00179(0.0000515) 
0.00116(0.0000376) 
0.000683(0.0000121) 



Kernel (LSCV) 

0.00514(0.000322) 
0.00146(0.000442) 
0.000880(0.000176) 
0.000583(0.0000192) 



Kernel (LSCV) 

0.00484(0.000244) 
0.00150(0.000363) 
0.000925(0.0000131) 
0.000577(0.0000651) 
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Table 3: Mean integrated squared error estimates (with standard errors in brackets where applicable; 
d = 3) 

(a) Independent Normal 



n 


LogConcDEAD 


Kernel (opt MISE) 


xvernei v ) 


i nn 

1UU 


0.00426(0.000131) 


0.00240 


n nn^n^i'n nnn970A 


^nn 
ouu 


0.000835(0.0000302) 


0.00106 


n nnizfUn nnnn^t^s^ 

U.UUl^O^U.UUUUOOO ) 


1000 


0.000442(0.0000236) 


0.000737 


0.000888(0.0000139) 


2000 


0.000304(0.0000238) 


0.000508 


0.000579(0.00000985) 






(b) Dependent Normal 




n 


LogConcDEAD 


Kernel (opt MISE) 


jvei nei ^Loij v j 


1 nn 

1UU 


0.00467(0.000147) 


0.00254 


U.UUOOU^U.UUUoDl ) 


^nn 
ouu 


0.000812(0.0000301) 


0.00112 


n nni ^9(Ti nnnn^tf^ 

U.UUIOZ^U.UUUUOO / J 


1000 


0.000431(0.0000249) 


0.000778 


0.000922(0.0000145) 


2000 


0.000304(0.0000233) 


0.000537 


0.000603(0.00000676) 




(c) r(2, 1) (independent components) 




71 


LogConcDEAD 


Kernel (opt AMISE) 


jvei nei i v i 


1 nn 

1UU 


0.00365(0.000142) 


0.00344 


n n7/ii (r\ nn/inn^ 

U.U 1 41 ^U.UU4UU ) 


^nn 
ouu 


0.000779(0.0000243) 


0.00136 


u.uuiyz^u.uuuuoio ) 


1000 


0.000538(0.000104) 


0.000922 


0.00123(0.0000262) 


2000 


0.000292(0.0000414) 


0.000622 


0.000849(0.0000228) 




(d) Normal location mixture, = 1 




71 


LogConcDEAD 


Kernel (opt MISE) 


T/ DrnD l / 1 CPV^l 

j\.er nei ^Lov^ v j 


1 nn 


0.00395(0.000124) 


0.00214 


U.UU440^U.UUUZ4Z j 


^nn 
ouu 


0.000743(0.0000272) 


0.000946 


n nni o/i (c\ nnnnoocA 
u.uuiz4^u.uuuuzyo ) 


1 nnn 

1UUU 


0.000446(0.0000218) 


0.000656 


n nnn^99i'n nnnni7cn 

U.UUUOZZ^U.UUUUl / cf J 


2000 


0.000265(0.0000202) 


0.000452 


0.000508(0.00000537) 




(e) Normal location mixture, = 2 




n 


LogConcDEAD 


Kernel (opt MISE) 


Kernel (LSCV) 


100 


0.00319(0.000100) 


0.00168 


0.00371(0.000203) 


500 


0.000596(0.0000231) 


0.000748 


0.00103(0.0000340) 


1000 


0.000329(0.0000173) 


0.000520 


0.000656(0.0000160) 


2000 


0.000220(0.0000171) 


0.000358 


0.000410(0.00000519) 




(f) Normal location mixture, \\fi\\ = 3 




n 


LogConcDEAD 


Kernel (opt MISE) 


Kernel (LSCV) 


100 


0.00328(0.0000930) 


0.00166 


0.00296(0.000120) 


500 


0.000803(0.0000184) 


0.000751 


0.000998(0.000254) 


1000 


0.000552(0.0000169) 


0.000525 


0.000613(0.0000892) 


2000 


0.000401(0.0000133) 


0.000364 


0.000404(0.00000488) 



14 



likelihood estimator will converge to the density /* that minimises the Kullback-Leibler divergence 
d(fo II /) = / fo (x) log jjfe^ dx over / G J-q. Such a result would be interesting for robustness 
purposes, because it could be interpreted as saying that provided the underlying density does not 
violate the log-concavity assumption too seriously, the log-concave maximum likelihood estimator is 
still sensible. 



6 Clustering example 



In a recent paper. IChang and Waltherl (|2008f ) introduced an algorith m which combines the u nivariate 
log-concave maximum likelihood estimator with the EM algorithm (jDempster et all 119771 ). to fit a 
finite mixture density of the form 



p 



(6.1) 



where the mixture proportions %%, . . . ,ir p are positive and sum to one, and the component densities 
fx, . . .,f p are u nivariate and log-concave. The method is an extension of the standard Gaussian EM 
algorithm, e.g. iFralev and Raftervl t|2002h . which assumes that each component density is normal. 
Once estimates n\, . . . , ic p , fx, ■ ■ ■ , f v have been obtained, clustering can be carried out by assigning 
to the jth cluster those observations Xi for which j = argmax r 7r r / r (Xj). IChang and Walther 
( 20081 ) show empirically that in cases where the true component densities are log-concave but not 
normal, their algorithm tends to make considerably fewer misclassifications and have smaller mean 
absolute error in the mixture proportion estimates than the Gaussian EM algorithm, with very 
similar performance in cases where the true component densities are normal. 



Owing to the previous lack of an alg orithm for computing the m aximum likelihood estimator of a 
multidimensional log-concave density, IChang and Waltherl (|2008l ) discuss an extension of the model 
in (|6.1[) to a multivariate context where the univariate marginal densities of each component in 
the mixture are assumed to be log-concave, and the dependence structure within each component 
density is modelled with a normal copula. Now that we are able to compute the maximum likelihood 
estimator of a multidimensional log-concave density, we can carry this method through to its natural 
conclusion. That is, in the finite mixture model (|6.ip for a multidimensional log-concave density 
/, we simply assume that each of the component densities fx, ■ ■ . , f p is log-concave. An interesting 
problem that we d o not address here that of finding appropriate conditions under which this model 
is identifiable - see iTitterington et all (|1985l . Section 3.1) for a nice discussion. 



6.1 EM algorithm 



An introduction to the EM algorithm can be found in lMcLachlan and Krishnan (1997). Briefly, given 



current estimates of the mixture proportions and component densities tt^ 
the £th iteration of the algorithm, we update the estimates of the mixture proportions by setting 



fi> ^ a t 
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+1) = n 1 YT i= i §ij for 3 = 1> • • • ,P> where 



is the current estimate of the posterior probability that the ith observation belongs to the jth 
component. We then update the estimates of the component densities in turn using the algorithm 
described in Section [U choosing fj i+1 ^ to be the log-concave density fj that maximises 



E 



#3 log 



The incorporation of the weights 0\ j , . . . , 6 n ^ in the maximisation process presents no additional 
complication, as is easily seen by inspecting the proof of Theorem [2j As usual with methods based 
on the EM algorithm, although the likelihood increases at each iteration, there is no guarantee that 
the sequence converges to a global maximum. In fact, it can happen that the algorithm produces 
a sequence that approaches a degenerate solution, corresponding to a component concentrated on 
a single observation, so that the likelihood becomes arbitraril y high. The same i ssue c an arise 
when fitting mixtures of Gaussian densities, and in this context iFralev and Raftervl (I2002T ) suggest 



that a Bayesian approach can alleviate the problem in these instances by effectively smoothing the 
likelihood. In general, it is standard practice to restart the algorithm from different initial values, 
taking the solution with the highest likelihood. 



6.2 Breast cancer example 



We ill ustrate the log-concave EM algorithm on the Wisconsin breas t cancer data set of Street et al\ 



(1993), available on the UCI Machine Learning Repository website ( Asuncion and Newman . 20071 ) 



http: //archive . ics .uci . edu/ml/datasets/Breast+Cancer+Wisconsin+y,28Diagnostic/,29 

The data set was created by taking measurements from a digitised image of a fine needle aspirate of 
a breast mass, for each of 569 individuals, with 357 benign and 212 malignant instances. We study 
the problem of trying to diagnose (cluster) the individuals based on the standard errors of two of the 
measurements, namely the radius of the cell nucleus (mean of distances from center to points on the 
perimeter, X) and its texture (standard deviation of grey-scale values, Y). The data are presented in 
Figuredja). In fact, the full data set consists of 30 measurements for each patient, representing the 
mean, standard error and 'worst' (mean of the three largest values) of 10 different features computed 
for each cell nucleus in the image. Since one would reasonably expect the means of each feature to 
be approximately normally distributed, and hence the Gaussian EM algorithm to be appropriate, we 
took the standard errors of the first two measurements to illustrate the log-concave EM algorithm 
methodology. 

It is important also to note that although for this particular data set we do know whether a particular 
instance is benign or malignant, we did not use this information in fitting our mixture model. 
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(a) Data 



o Correct 
■ Incorrect 




575 To E5 57o 57eT 
X 

(c) Log-concave mixture classification 



Correct 
Incorrect 




575 Eo 375 57o 57s - 

X 

(b) Gaussian mixture classification 




(d) Estimated log-concave mixture 



Figure 4: Panel (a) plots the Wisconsin breast cancer data, with benign cases as solid squares 
and malignant ones as open circles. Panel (b) gives a contour plot together with the misclassified 
instances from the Gaussian EM algorithm, while the corresponding plot obtained from the log- 
concave EM algorithm is given in Panel (c). Panel (d) plots the fitted mixture distribution from the 
log-concave EM algorithm. 
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Instead this information was only used afterwards to assess the performance of the method, as 
reported below. Thus we are studying a clustering (or unsupervised learning) problem, by taking 
a classification (or supervised learning) data set and 'covering up the labels' until it comes to 
performance assessment. 

The skewness in the data suggests that the mixture of Gaussians model may be inadequate, and in 
Figure[?fb) we show the contour plot and misclassified instances from this model. The corresponding 
plot obtained from the log-concave EM algorithm is given in Figure Hfc) , while Figure E^d) plots 
the fitted mixture distribution from the log-concave EM algorithm. For this example, the number 
of misclassified instances is reduced from 144 with the Gaussian EM algorithm to 121 with the 
log-concave EM algorithm. 



In some examples, it will be necessary t o estimate p, the number of mixture components. In the 
general context of model-based clustering, Fraley and Rafteryl ( 2002 ) cite several possible ap proaches 
for this purpose, incl uding methods b ased on resampling ( McLachlan and Basford . 19881 ) and an 
information criterion (jBozdogan . 1994 ). Further research will be needed to ascertain which of these 
methods is most appropriate in the context of log-concave component densities. 



7 Plug-in estimation of functionals, sampling and the boot- 
strap 



Suppose X has density /. Often, we are less interested in estimating a density directly than in 
estimating some functional 0(f). Examples of functionals of interest (some of which were given in 
Section [lj, include: 



(a) P(||X|| > l) = ff(x)l {M > iy dx 

(b) Moments, such as E(X) = J xf(x) dx, or E(||X|| 2 ) = / \\x\\ 2 f(x) dx 

(c) The differential entropy of X (or /), defined by H(f) = — J f(x) \ogf(x) dx 

(d) The 100(1 - a)% highest density region, defined by R a = \x <£ R d : f( x) > /«,}, where f a 
is the largest constant such that ¥(X £ R a ) > 1 — a. iHvndman ( 19961 ) argues that this is 
an informative summary of a density; note that subject to a minor restriction on /, we have 
/ fi^Uix^U] dx=l-a. 



Each of these may be estimated by the corresponding functional 0(f n ) of the log-concave maximum 
likelihood estimator. In examples (a) and (b) above, 0(f) may also be written as a functional of the 
corresponding distribution function F, e.g. P(||X|| > 1) = J l/i| a .|i>i)dF(a;). In such cases, it is more 
natural to use the plug- in estimator based on the empirical distribution function, F n , of the sample 
Xi, . . . , X n , and indeed in our simulations we found that the log-concave plug-in estimator did not 
offer an improvement on this method. In the other examples, however, an empirical distribution 
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function plug-in estimator is not available, and the log-concave plug-in estimator is a potentially 
attractive procedure. 

7.1 Monte Carlo estimation of functionals and sampling from the density 
estimate 

For some functionals we can compute — 9(f n ) analytically. If this is not possible, but we can write 
0(f) = J f(x)g(x) dx, we may approximate by 

1 B 

b=l 

for some (large) B, where X*, . . . , Xg are independent samples from /„. Conditional on X\, . . . , X n , 
the strong law of large numbers gives that Ob — ► as B — > oo. In practice, even when analytic 
calculation of was possible, this method was found to be fast and accurate. 

In order to use this Monte Carlo procedure, we must be able to sample from f n . Fortunately, this 
can be done efficiently using the following rejection sampling procedure. As in Section dj for j G J 
let Aj be the d x d matrix whose Zth column is Xj t+1 — Xj t for I = l,...,d, and let aj — Xj t , 
so that w i— > Ajiv + aj maps the unit simplex to C n j. Recall that log f n (Xi) = y*, and let 
z i = ( Z JA> ■ ■ ■ i z j,d), where Zj t i = y* l+1 - y* n for I = 1, . . . , d. Write 

<7j = / fn(x) dx. 
JC n ,j 

We may then draw an observation X* from /„ as follows: 

(i) Select j* £ J, selecting j* — j with probability qj 

(ii) Select w ~ Unif(Td) and u ~ Unif([0, 1]) independently. If 

< cxp((w,z r )) 
max^eTd exp({v, zj*)) ' 

accept the point and set X* = Ajw + aj. Otherwise, repeat (ii). 

7.2 Simulation study 

In this section we illustrate some simple applications of this technique to functionals (c) and (d) 
above, using the Monte Carlo procedure and sampling scheme described in Section 17.11 Estimates 
are based on random samples from a ^(0, /) distribution, and we compare the performance of the 
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Table 4: (a) gives mean squared errors for estimating the differential entropy of the A^O,/) distri- 
bution; (b) gives E{fif(R a AR a )} when estimating highest density regions. The numbers in brackets 
are Monte Carlo standard errors. 

(a) Differential entropy 



100 
500 
1000 
2000 



n LogConcDEAD 

100 0.0761(0.00629) 

500 0.00819(0.000653) 

1000 0.00378(0.000391) 

2000 0.00177(0.000232) 



Kernel 

0.0457(0.00304) 
0.0137(0.000839) 
0.00716(0.000581) 
0.00427(0.000345) 



(b) 25%/50%/75% highest density regions 

LogConcDEAD Kernel 
0.0872(0.0024)/0.110(0.0033)/0. 121(0.0047) 
0.0419(0.0010) /0.0587(0.0014) /0.0680(0.0022) 
0.0311(0.00075)/0.0447(0.0011)/0.0536(0.0016) 
0.0241(0.00054)/0.0363(0.00080)/0.0448(0.0013) 



0.0753(0.0017) /0.0995(0.0028) /0.0959(0.0038) 
0.0467(0.0011)/0.0609(0.0013)/0.0637(0.0019) 
0.0376(0.00095) /0.0476(0.0012) /0.0477(0.0015) 
0.0322(0.00081)/0.0371(0.00098)/0.0399(0.0013) 



LogConcDEAD estimate with that of a kernel-based plug-in estimate, where the bandwidth matrix 
was chosen using our knowledge of the underlying density to minimise the MISE. 

Table SJa) gives mean squared errors (with Monte Carlo standard errors) of the plug-in estimates 
of the differential entropy. In Table [4fb) we study the plug- in estimators R a of the highest density 
region R a , and measure the quality of the estimation procedures through E{/i/(i? Q A R a )}, where 
Hf(A) = J A f(x)dx and A denotes set difference. Highest density regions can be computed once 
we have a pproximated the sample versions of f a using the density quantile algorithm described in 
Hvndmanl (|l996l . Section 3.2). 

For the differential entropy estimators, we find a similar pattern to that observed in Section [5] 
the log-concave plug-in estimator provides an improvement on the kernel-based estimator for the 
moderate and large sample sizes in our simulations. For the case of highest density regions, the 
relative performance of the log-concave estimator is better for the estimation of smaller density 
regions. In Figure 03 we illustrate the estimation of three highest density regions based on 500 
points from a N 2 (0,I) distribution. For comparison, a kernel-based plug-in estimate (where the 
regions are not guaranteed to be convex) is also given. 

In real data examples, we are unable to assess uncertainty in our functional estimates by taking 
repeated samples from the true underlying model. Nevertheless, the fact that we can sample from 
the log-concave maximum likelihood estimator does mean that we can apply standard bootstrap 
methodology to compute standard errors or confidence intervals, for example. Finally, we remark 
that the plug-in estimation procedure, sampling algorithm and bootstrap methodology extend in an 
obvious way to the case of a finite mixture of log-concave densities. 
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LogConcDEAD estimate. 500 N(0,l) observations 



Ttoe HDRs, 500 N(0,l) observations 





(c) Kernel estimate 



Figure 5: Estimates of the 25%, 50% and 75% highest density region from 500 observations from 
the -ZV 2 (0,I) distribution. 
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8 Concluding discussion 



We have developed methodology that gives a fully automatic nonparametric density estimate under 
the condition that the density is log-concave, and shown how it may be extended to fit finite mix- 
tures of log-concave densities. We have indicated a wide range of possible applications, including 
classification, clustering and functional estimation problems. The area of shape-constrained estima- 
tion is currently undergoing rapid growth, as evidenced by the many recent publications cited in the 
penultimate paragraph of Section [TJ as well as recent workshops in Oberwolfach (November 2006), 
Eindhoven (October 2007) and Bristol (November 2007). We hope that this paper will stimulate 
further interest and research in the field. 

As well as the continued development and refinement of the computational algorithms and graphical 
displays of estimates, and studies of theoretical performance, there remain many challenges and 
interesting directions for future research. These include: 

(i) Studying othe r shape constraint s. These have received some attention for univariate data, 
dating back to ICxrenanderl (jl956h . but much less in the multivariate setting. 



(ii) Developing both formal and informal diagnostic tools for assessing the validity of shape con- 
straints. 

(iii) Assessing the uncertainty in shape-constrained nonparametric density estimates, through con- 
fidence intervals/bands. 

(iv) Developing analogous methodology for discrete data from shape-constrained distributions. 

(v) Examining nonparametric shape constraints in regression problems. 

(vi) Studying methods for choosing the number of clusters in nonparametric, shape-constrained 
mixture models. 



Glossary of terms and results from convex analysis and 
computational geometry 



All of the definitions and results below can be found in iRockafellarl (|1997T ) and iLed (|1997I ). The 

epigraph of a function / : R fe — > [-co, oo) is the set 

epi(/) = {(x,fi) :xeR k ,fieR,fi< f(x)}. 

We say / is concave if its ep igraph is non-emp t y and convex as a subset of R fc+1 ; note that this 
agrees with the terminology of lBarndorff-Nielsenl (|1978T ). but is what IRockafellarl (|l997T ) calls a proper 
concave function. If C is a convex subset of M. k then provided / : C — > [-co, oo) is not identically 
— oo, it is concave if and only if 



f(tx+(l-t)y)>tf(x) + (l-t)f(y) 
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for x,y G C and i £ (0,1). A non-negative function / is log-concave if log / is concave, with the 
convention that logO = — co. The support of a log-concave function / is {x G R k : log/(x) > — co}, 
a convex subset of R fc . 

A subset M of R fc is affine if tx + (1 - t)y G M for all x, y G M and t € R. The c#ne ZiuZZ of 
M, denoted aff(M), is the smallest afhne set containing M. Every non-empty affine set M in R fe 
is parallel to a unique subspace of R* , meaning that there is a unique subspace L of R fc such that 
M = L + a, for some a G R*. The dimension of M is the dimension of this subspace, and more 
generally the dimension of a non-empty convex set is the dimension of its affine hull. A finite set of 
points M = {xq,xi, . . . ,Xd} is affinely independent if aff(M) is d-dimensional. The relative interior 
of a convex set C is the interior which results when we regard C as a subset of its affine hull. The 
relative boundary of C is the set difference between its closure and its relative interior. If M is an 
affine set in R fe , then an affine transformation (or afffine function) is a function T : M — > R fc such 
that T(tx + (1 - t)y) = tT(x) + (1 - t)T(y) for all x, y G M and t G R. 

The closure of a concave function g on R d , denoted cl(g), is the function whose epigraph is the closure 
in R d+1 of epi(g). It is the least upper semi-continuous, concave function satisfying c\(g) > g. The 
function g is closed if cl(g) — g. An arbitrary function h on R d is continuous relative to a subset 
S of R d if its restriction to S is a continuous function. A non-zero vector z G R d is a direction of 
increase of /i on R d if f /i(x + tz) is non-decreasing for every x G R d . 

The convex hull of finitely many points is called a polytope. The convex hull of d + 1 affinely 
independent points is called a d-dimensional simplex (pi. simplices). If C is a convex set in R d , 
then a supporting half-space to C is a closed half-space which contains C and has a point of C in its 
boundary. A supporting hyperplane H to C is a hyperplane which is the boundary of a supporting 
half-space to C. Thus H = {x G R d : (x, b) = /?}, for some 6 G R d and /? G R such that (x, b) < [3 
for all x G C with equality for at least one x G C. 

If F is a finite set of points in R d such that P = conv(t^) is a d-dimensional polytope in R d , then a 
/ace of P is a set of the form P C\H , where H is a supporting hyperplane to P. The vertex set of P, 
denoted vert(P), is the set of 0-dimensional faces (vertices) of P. A subdivision of P is a finite set 
of d-dimensional polytopes {Si, . . . , S t } such that P is the union of Si, . . . , St and the intersection 
of any two distinct polytopes in the subdivision is a face of both of them. If S = {S\, . . . , St} and 
S = {Si, . . . ,S t '} are two subdivisions of P, then S is a refinement of S if each Si is contained in 
some Si'. The trivial subdivision of P is {P}. A triangulation of P is a subdivision of P in which 
each polytope is a simplex. 

If P is a d-dimensional polytope in R d , P is a (d — l)-dimensional face of P and v G R d , then there 
is a unique supporting hyperplane H to P containing F. The polytope P is contained in exactly 
one of the closed half-spaces determined by H, and if v is in the opposite open half-space, then F is 
visible from v. If V is a finite set in R d such that P = conv(F), if v G V and S = {Si, . . . , S t } is a 
subdivision of P, then the result of pushing v is the subdivision S of P obtained by modifying each 
Si G S as follows: 

(i) If v i Si, then Si € S 
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(ii) If v G Si and conv(vert(iSj) \ {v}) is (d — l)-dimensional, then Si G S 

(iii) If v G Si and St = conv(vert(5;) \ {v}) is <i-dimensional, then St G S. Also, if F is any 
(d — l)-dimensional face of St that is visible from v, then conv(F U {v}) G S. 

If cr is a convex function on K™, then y' G R n is a subgradient of cr at y if 

ct(z) > o-(y) + (y',z- y) 

for all 2 G M™. If a is differentiable at y. then Vcr(y) is the unique subgradient to a at y; otherwise 
the set of subgradients at y has more than one element. The one-sided directional derivative of cr at 
y with respect to z G K™ is 

// x r o-{y + tz)-cr(y) 
a(y;z) = hm , 

which always exists (allowing -co and oo as limits) provided a(y) is finite. 



B Proofs 



Proof of Proposition Q] 

(a) If / is log-concave, then for x G M. d , we can write 

fx\p v (X)(x\t) oc f{x)l { p v(x)=t} , 
a product of log-concave functions. Thus fx\p v (x){'\t) is log-concave for each t. 

(b) Let x\, X2 G K d be distinct and let A G (0, 1). Let V be the (d — l)-dimensional subspace of M. d 
whose orthogonal complement is parallel to the affine hull of {xi,^} (he. the line through xi and 
x,2). Writing fp v (x) f° r the marginal density of Py(X) and t for the common value of Pv{x\) and 
Pv{zi), the density of X at x G M d is 

/fa) = fx\Pv(X){x\t)fp v{X ){t)- 

Thus / is log-concave, as required. □ 
Completion of the Proof of Theorem 

We prove each of the steps (i)-(v) outline d in Section [5 in tu rn. First note that if io £ C„, then by 
Caratheodory's theorem (Theorem 17.1 of iRockafellarl (1997J))j there exist distinct indices ..,i r 
with r < d+ 1, such that xq = Y)t—i AjX^ with each A; > and 531=1 ^ l = ^ Thus, if /(xq) = 0, 
then by Jensen's inequality, 

r 

-oo = log/(x ) > ^2 Az log/^J, 
so f(Xi) = for some i. But then ipn(f) = — oo. This proves (i). 
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Now suppose f(xo) > for some xq C n . Then {x : f(x) > 0} is a convex set containing C n U {20}, 
a set which has strictly larger d-dimensional Lebesgue measure than that of C n . We therefore have 
V>n(/) < ^n(/lcj, which proves (ii). 

To prove (hi), we first show that log / is closed. Suppose that log f{Xj) — yi for i — 1, . . . , n but that 
log / ^ h y . Then since \ogf(x) > h y (x) for all x G R d , we may assume that there exists xq G C n 
such that log/(a;o) > h y (xo). If xq i s in the relative i nterior of C n , then since log/ and are 
continuous at xq (by Theorem 10.1 of lRockafellan (Il997h 'l. we must have 

V>«(/) < ^n(exp(/i a )). 

The only remainin g possibility is tha t xq is on the relative boundary of C„. But /i y is closed by 
Corollary 17.2.1 of iRockafellarl ( 19971 ). so writing cl(g) for the closure of a c oncave function q we 
have /i y = c\(h y ) = cl(log/) > log/, where we have used Corollary 7.3.4 of Rockafellarl (19971) to 
obtain the middle equality. It follows that log / is closed and log / = which proves (hi). 

Note that log / has no direction of increase, because if x G C n , z is a non-zero vector and t > is 
large enough that x + tz ^ C n , then — oo = log/(x + tz) < log/(x). It follows by Theorem 27.2 of 
Rockafellarl (|l997h that the supremum of / is finite (and is attained). Using properties (i) and (ii) 



as well, we may write / f(x)dx = c, say, where c G (0, oo). Thus f(x) — cf{x), for some / G !Fq. 
But then 

Mf) - Mf) = -i - io gc + c > o, 

with equality only if c = 1. This proves (iv). 

To prove (v), we may assume by (iv) that exp(/i y ) is a density. Let maxj/i y (Xi) = M and let 
mini hy(Xi) — m. We show that when M is large, in order for exp(/i y ) to be a density, m must be 
negative with \m\ so large that ^„(exp(/i a )) < ijj n (f). First observe that if x G C n and h y (Xi) = M, 
then for M sufficiently large we must have M — m > 1, and then 



_ / \ \ \ M — m — 1 - 

m (M - m - 1)M 

> T7 + " 77 — = M — I. 

M — m M — m 

(The fact that h y (x) > m follows by Jensen's inequality.) Hence, denoting Lebesgue measure on R d 
by /X, we have 

M x : hy { x) > M - 1}) > ,({ Xi + ^(C n - X,)}) = J 0± ¥ . 

Thus 

Af-l /"(Cn) 



/ exp{h y (x)} dx > 



(M - m) d ' 

For exp(/ij,) to be a density, then, we require m < — ie( M-1 )/ £i ^(C n ) 1 / d when M is large. But then 

Vn(exp(M) < (n ~ 1)M - J- e (^-D/V(C„) 1 / d < V„(/) 
v ' n 2n 
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when M is sufficiently large. This proves (v). 



It is not hard to see that for any M > 0, the function y i— > ip n (exp(h y )) is continuous on the compact 
set [-M, M] n , and thus the proof of the existence of a maximum likelihood estimator is complete. 
To prove uniqueness, suppose that /i,/2 G T and both fa and fa maximise ip n (f)- We may assume 
fx, fa £ -^o, log/iifog/2 G W and /1 and /2 are supported on C„. Then the normalised geometric 
mean 

{/ito/aQr)} 1 ' 2 

W /c„{/i(y)/2(2/)} 1/2 ^' 

is a log-concave density, with 

^ n 1 n /" 

^»Grt = ^y>g/i(*i) + ^y>g/a(*i)-log / {/i(«)My)} 1/3 %-i 

Z71 //^ 
z=l i=l 

= V n (/i)-log/ {A(y)/ 2 (y)} 1/2 d2/. 

•/C„ 

However, by Cauchy-Schwarz, J Cn {fa(y)fa(y)} 1/2 dy < 1, so t/> n (g>) > ip„{fi). Equality is ob- 
tained if and onl y if /1 = / 2 a lmost everywhere, but since /1 and fa are continuous relative to C ra 
(Theorem 10.2 of iRockafellar (jl997l )). this implies that fi — fa. An alternative way of proving the 



uniqueness of the maximum likelihood estimator may be based on the fact that tp n (tfi + (l — t)fa) > 
ttpn(fi) + (1 — t)ipn(fa) for all t G (0, 1), provided fa and fa are distinct elements of T . □ 

Proof of Theorem [3] 

For t G (0, 1) and y^ x \y^ G R™, the function h ty d) + H_ t \ y w is the least concave function satisfying 

h ty{ i )+(1 _ t)yi 2)(X t ) > tyf> + (1 - t)yf> for i = 1, ... ,71, so h ty( i )+(1 _ t)y( 2) < th y m + (1 - t)h y m- 
The convexity of a follows from this and the convexity of the exponential function. It is clear that 
cr > r, since h y (Xj) > yi for i = 1, . . . , n. 

From Theorem[2j we can find y* G R™ such that log/„ = /ij,* with /ij,* (Xi) = y| for i = 1, . . . , n, and 
this y* minimises r. For any other y G R™ which minimises r, by the uniqueness part of Theorem [5] 
we must have h y — h y *, so a(y) > a(y*) — r(y*). □ 



B.l Non-differentiability of a and computation of subgradients 

In this section, we find explicitly the set of points at which the function a defined in (|4.1[) is 
diffcrcntiable, and compute a subgradient of a at each point. For i = 1, . . . , n, define 

Ji = {j = (ji, . . . , j d+ i) G J :i= ji for some I = 1, . . . , d + 1}. 

The set Ji is the index set of those simplices C nj that have Xi as a vertex. Let y denote the set 
of vectors y = [y\,... ,y n ) G R™ with the property that for each j — (ji, ... , jd+i) G J, if i ^ ji for 
any I then 

{ (Xi, yi), (X h ,y h ),..., (X jd+1 , } 



2G 



is affinely independent in This is the set of points for which no tent pole is touching but not 

critically supporting the tent. Notice that the complement of y has zero Lebesgue measure in R n . 
For y £ R" and i= 1, . . . , n, and in the notation of Section [U let 



! 1 < :i = ~ + E I det A i I / e (w ' Zi)+Vh { (l - E w ) Mn =*} + E w < } 

ieJ, Td i=i i=i 



Proposition 4. Assume (Al). (a) For y £ y, the function a is differ entiable at y and for i = 
1, . . . , n satisfies 

^(y) = d i (y). 
oyt 

(b) For y £ y c , the function a is not differ entiable at y, but the vector (d\(y), . . . , d n (y)) is a 
subgradient of a at y. 



Proof. By Theorem 25.2 of iRockafellarl (|1997l ). it suffices to show that for y £ y, all of the partial 
derivatives exist and are given by the expression in the statement of the proposition. For i = 1, . . . , n 
and let j/W = y + ie", where e™ denotes the ith unit coordinate vector in W 1 . For sufficiently 

small values of |t|, we may write 

z / \ f (x, by) — By if x £ C„ ,■ for some j 6 J 
^ — oo if a; f. C n , 

for certain values of bf , . . . , b$ £ R d and Bf , . . . , [3$ £ R. If j <£ J h then bf = b 3 and Bf = Bj 
for sufficiently small \t\. On the other hand, if j G Jj, then there are two cases to consider: 

(i) If ji — i, then for sufficiently small t, we have zf = Zj — tl^, where 1^ denotes a d-vector of 
ones, so that bf = bj - t(Aj)- 1 l d and Bf = Bj - t(l + (Aj 1 ^, l d )) 

(ii) If ji+i = i for some I £ {1, ... , d}, then for sufficiently small t, we have zf = Zj + tef, so that 
bf = bj + t(Ap" 1 ef and Bf = Bj + t(Af aj ,ef). 
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where to obtain the final line we have made the substitution x = Ajw + atj, after taking the limit 
as t — > 0. 

(b) If y G y c , then it can be shown that there exists a unit coordinate vector e™ in R" such that 
the one-sided directional derivative at y with respect to e™, denoted cr'(y;e™), satisfies cr'(j/;e") > 
—o-'(y; — e"). Thus er is not differentiable at y. To show that d(jj) = (d\(y), . . . , d n {y)) is a subgra- 
dient of <r at y, it is enough by Theorem 25.6 of [Rockafcllar ( 19971 ) to find, for each e > 0, a point 



jjel" such that \\y — y\\ < e and such that a is differentiable at y with ||Vcr(y) — d(y)\\ < e. This 
can be done by sequentially making small adjustments to the components of y in the same order as 
that in which the vertices were pushed in constructing the triangulation. □ □ 



A subgradient of a at any y S K™ may be computed using Proposition [H (|B.1|) and (|4.2[) once we 
have a formula for 



Id,u[z) = J w u exp z r w^] dw, 



when zi,...,Zd are non-zero and distinct. In ICule et ali (|2008b). it is shown that the required 
formula is 



~^—^ Z r (z r Z u ) [Z r Zg) ~^—^ Z r (z r Z u ) ^\ {z r Zg) 

l<r<d Ks<d Kr<d Ks<d 



s^r 



(-l)ci(e^ _ i) e *« 



Z u Y\ s =l 



n 



i 



Zu .^ J \, ( z u z s) 

l<s<d 



(B.l) 
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